INTRODUCTION

HIV continues to devastate communities across Africa, especially in sub-Saharan regions, where the epidemic hits hardest. By tracking how HIV affects populations over time, we can better understand its spread, direct lifesaving resources where they’re needed most, and craft policies that make a real difference. But HIV isn’t just a health issue—it’s deeply tied to poverty, unemployment, and inequality, trapping many in cycles of hardship.

This analysis tackles two key challenges:

  1. HIV Trends and Poverty Where is HIV hitting hardest? I will explore trends in the countries that bear 75% of the global burden, both worldwide and within each WHO region.

How does poverty shape—and get shaped by—HIV? By combining WHO data with the World Bank’s multidimensional poverty index,I will uncover hidden connections using statistical modeling.

  1. Child Mortality in East Africa How are children under five, and newborns, faring in the eight East African Community (EAC) nations?

l will map the latest estimates to see where help is most urgent and track changes over time to spot which countries face the steepest struggles.

By digging into these questions, I aim to turn data into action—helping policymakers, healthcare workers, and communities fight back more effectively.

QUESTION 1
knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  fig.width = 7,    
  fig.height = 5,   
  out.width = "100%",
  dpi = 300        
)
# Loading required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(sf)
## Linking to GEOS 3.13.0, GDAL 3.10.1, PROJ 9.5.1; sf_use_s2() is TRUE
library(geodata)
## Warning: package 'geodata' was built under R version 4.4.3
## Loading required package: terra
## Warning: package 'terra' was built under R version 4.4.3
## terra 1.8.42
## 
## Attaching package: 'terra'
## 
## The following object is masked from 'package:scales':
## 
##     rescale
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(tmap)
## Warning: package 'tmap' was built under R version 4.4.3
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.4.3
## 
## Attaching package: 'patchwork'
## 
## The following object is masked from 'package:terra':
## 
##     area
1. HIV Trends and Poverty Analysis

1.1 Data Loading and Cleaning

# Loading HIV data 
hiv_raw <- read.csv(file.choose())

# Initial exploration
#head(hiv_data)
#tail(hiv_data)
#str(hiv_data)
#summary(hiv_data)
#colSums(is.na(hiv_data))

# Cleaning and extract numeric estimates and bounds
hiv_clean <- hiv_raw %>%
  # Removing both spaces and commas from Value
  mutate(
    val_stripped = str_remove_all(Value, "[, ]"),
    estimate    = as.numeric(str_extract(val_stripped, "^[0-9]+")),
    lower_bound = as.numeric(str_extract(val_stripped, "(?<=\\[)[0-9]+")),
    upper_bound = as.numeric(str_extract(val_stripped, "(?<=-)[0-9]+(?=\\])")),
    year        = as.integer(Period),
    region_code = ParentLocationCode  
  ) %>%
  select(
    country         = Location,
    region_code,
    year,
    estimate,
    lower_bound,
    upper_bound
  )
1.2 Countries Contributing 75% of Global HIV Burden
library(ggplot2)
library(scales)   
library(forcats)   

# Identifying top 75% burden countries
top_countries <- hiv_clean %>%
  group_by(country) %>%
  summarise(total = sum(estimate, na.rm = TRUE)) %>%
  arrange(desc(total)) %>%
  mutate(
    pct     = total / sum(total),
    cum_pct = cumsum(pct)
  ) %>%
  filter(cum_pct <= 0.75) %>%
  pull(country)

# Prepare plot data
plot_data <- hiv_clean %>%
  filter(country %in% top_countries) %>%
  mutate(country = fct_reorder(country, estimate, .fun = max, .desc = TRUE))


trend_plot <- ggplot(plot_data, aes(x = year, y = estimate, colour = country)) +
  geom_line(size = 1, na.rm = TRUE) +
  geom_point(size = 2, na.rm = TRUE) +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(breaks = pretty_breaks()) +
  labs(
    title = "Trend of People Living with HIV (Top 75% Burden Countries)",
    x = "Year",
    y = "Number of People",
    colour = "Country"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    legend.text = element_text(size = 8),
    plot.title = element_text(size = 12)
  ) +
  guides(color = guide_legend(nrow = 3))


print(trend_plot)

# Saving as PNG 
ggsave(
  filename = "Top75_Burden_Countries_HIV_trend.png",  
  plot     = trend_plot,    
  width    = 10,             
  height   = 8,              
  dpi      = 300             
)

This graph highlights the countries carrying the heaviest HIV burden worldwide, with a clear focus on Sub-Saharan Africa, where the epidemic remains most severe. South Africa and Nigeria stand out with the highest numbers, reflecting both large populations and ongoing transmission challenges. Neighboring countries like Mozambique, Kenya, and Uganda also show significant caseloads, though at slightly lower levels. Outside Africa, Brazil and Thailand appear as key comparators—regions with concentrated epidemics but different healthcare responses. The data likely reflects the year 2020, meaning it captures early COVID-19 disruptions to testing and treatment programs. The graph underscores how HIV remains a critical public health issue, particularly in high-burden countries where prevention and care efforts must stay a priority.

1.3 Regional Trends within WHO Regions
library(cowplot)
library(dplyr)
# 4. Regional 75% contributors per WHO region
regional_contribs <- hiv_clean %>%
  group_by(region_code, country) %>%
  summarise(total_est = sum(estimate, na.rm = TRUE), .groups = "drop_last") %>%
  mutate(
    region_total = sum(total_est, na.rm = TRUE),
    contribution = total_est / region_total
  ) %>%
  arrange(region_code, desc(contribution)) %>%
  group_by(region_code) %>%
  mutate(cum_contrib = cumsum(contribution)) %>%
  filter(cum_contrib <= 0.75) %>%
  pull(country)

# Plot by region

plot_region <- hiv_clean %>%
  filter(country %in% regional_contribs) %>%
  ggplot(aes(x = year, y = estimate, color = country)) +
  geom_line(size = 1, na.rm = TRUE) +
  geom_point(size = 1.5, na.rm = TRUE) +
  facet_wrap(~ region_code, scales = "free_y", ncol = 3) +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(breaks = pretty_breaks()) +
  labs(
    title = "HIV Trends for Top 75% Contributors by WHO Region",
    x = "Year", 
    y = "People Living with HIV",
    color = "Country"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    legend.box = "vertical",
    legend.margin = margin(t = -20), 
    legend.text = element_text(size = 6),
    legend.title = element_text(size = 8),
    legend.key.size = unit(0.3, "cm"),
    strip.text = element_text(size = 8),
    plot.margin = margin(0.5, 0.5, 0.5, 0.5, "cm"),
    panel.spacing = unit(1, "lines")
  ) +
  guides(
    color = guide_legend(
      ncol = 4, 
      title.position = "top",
      title.hjust = 0.5
    )
  )
print(plot_region)

cowplot::save_plot(
  filename = "HIV_top75_by_region.png",
  plot     = plot_region,
  base_width  = 10,
  base_height = 8,
  dpi         = 300
)

This image gives us a snapshot of how the number of people living with HIV has trended since the year 2000, broken down by different regions defined by the World Health Organization (WHO). Each of the six smaller graphs focuses on a specific region – AFR (Africa), AMR (Americas), EMR (Eastern Mediterranean), EUR (Europe), SEAR (South-East Asia), and WPR (Western Pacific). Within each regional graph, you see lines representing the countries that contribute to the top 75% of the total number of people living with HIV in that particular region. The y-axis shows the number of people living with HIV, while the x-axis tracks the years. By looking at the slopes of these lines, we can see whether the number of people living with HIV has been increasing, decreasing, or staying relatively stable in these key contributing countries within each WHO region over the past couple of decades.

1.4 Merging with Multidimensional Poverty Data
library(readxl)
library(dplyr)
library(janitor)
library(stringr)
library(tidyr)

# === 5. Load & clean multidimensional poverty data ===
poverty_raw <- read_excel(
  file.choose(),
  skip = 2,
  .name_repair = "minimal"
)

# Inspect the raw names before renaming
print(colnames(poverty_raw))
##  [1] ""                           ""                          
##  [3] ""                           ""                          
##  [5] ""                           ""                          
##  [7] ""                           ""                          
##  [9] ""                           "Monetary (%)"              
## [11] "Educational attainment (%)" "Educational enrollment (%)"
## [13] "Electricity (%)"            "Sanitation (%)"            
## [15] "Drinking water (%)"         ""
poverty_data <- poverty_raw %>%
  clean_names() %>%
  rename_with(~ c(
    "region", "country_code", "country",
    "reporting_year", "survey_name", "survey_year",
    "survey_coverage", "welfare_type", "survey_comparability",
    "monetary_poverty", "education_attainment",
    "education_enrollment", "electricity_access",
    "sanitation", "drinking_water", "multidimensional_poverty"
  )[seq_along(.)]) %>%
  mutate(
    reporting_year = as.integer(reporting_year),
    survey_year = as.integer(survey_year),
    across(
      monetary_poverty:multidimensional_poverty,
      ~ as.numeric(str_remove_all(as.character(.), "[^0-9\\.]"))
    )
  )
1.5 Exploring the Relationship (Mixed-Effects Model)
library(lme4)
library(tidyverse) # Loads dplyr, tidyr, etc.


# DATA CLEANING
# Clean HIV data
hiv_data <- hiv_raw %>%
  # Standardize column names
  clean_names() %>%
  # Rename key columns
  rename(
    estimate = value,
    country = location,
    year = period
  ) %>%
  # Changing data types
  mutate(
    # Extracting numbers and converting to numeric
    estimate = as.numeric(gsub("[^0-9.]", "", estimate)),
    # Converting year to integer
    year = as.integer(year)
  )

# Merging with poverty data
merged_df <- left_join(
  hiv_data, 
  poverty_data, 
  by = c("country", "year" = "reporting_year")
)


# DATA DIAGNOSTICS

# Filter complete cases
merged_df_clean <- merged_df %>%
  filter(
    !is.na(estimate), 
    !is.na(multidimensional_poverty)
  )

# Print data structure
cat("\nDATA STRUCTURE:\n")
## 
## DATA STRUCTURE:
cat("Total observations:", nrow(merged_df_clean), "\n")
## Total observations: 38
cat("Unique countries:", length(unique(merged_df_clean$country)), "\n")
## Unique countries: 38
cat("Unique years:", length(unique(merged_df_clean$year)), "\n")
## Unique years: 3
# Show sample data
cat("\nSAMPLE DATA:\n")
## 
## SAMPLE DATA:
head(merged_df_clean[, c("country", "year", "estimate", "multidimensional_poverty")])
##         country year     estimate multidimensional_poverty
## 1         Benin 2015 7.100061e+14                 45.44324
## 2 Cote d'Ivoire 2015 4.600004e+17                 29.17709
## 3 Guinea-Bissau 2010 3.700033e+14                 38.73005
## 4         Kenya 2015 1.500000e+20                 38.49010
## 5        Malawi 2010 9.300009e+17                 78.25200
## 6        Zambia 2010 9.400009e+18                 66.50606
# MODEL FITTING

test_model <- lmer(
  estimate ~ multidimensional_poverty + (1 | year),
  data = merged_df_clean
)


# MODEL OUTPUT

cat("\nMODEL SUMMARY:\n")
## 
## MODEL SUMMARY:
summary(test_model)
## Linear mixed model fit by REML ['lmerMod']
## Formula: estimate ~ multidimensional_poverty + (1 | year)
##    Data: merged_df_clean
## 
## REML criterion at convergence: 3322.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1320 -0.0209  0.0059  0.0158  5.0475 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. 
##  year     (Intercept) 5.821e+38 2.413e+19
##  Residual             4.416e+38 2.101e+19
## Number of obs: 38, groups:  year, 3
## 
## Fixed effects:
##                           Estimate Std. Error t value
## (Intercept)              1.415e+19  1.559e+19   0.908
## multidimensional_poverty 1.248e+17  1.998e+17   0.624
## 
## Correlation of Fixed Effects:
##             (Intr)
## mltdmnsnl_p -0.204
#Checking model assumptions
if (FALSE) 
  plot(test_model)           
  qqnorm(resid(test_model))  

The analysis found no clear evidence that poverty levels directly affect HIV rates across countries. While the numbers showed a slight tendency for higher HIV cases in poorer nations, this connection wasn’t strong enough to be statistically meaningful. The model also struggled because HIV case numbers varied extremely between countries - from very few to extremely high - making reliable comparisons difficult. Additionally, with only three years of data (2010-2015), the analysis couldn’t properly account for changes over time. These limitations mean we can’t confidently conclude that poverty causes higher HIV rates from this particular study. The results suggest we may need better data or different methods to properly examine this relationship.

QUESTION 2
1. LOAD AND CLEAN DATA
# Load data with explicit path
tmort <- read.csv(file.choose()) 

# Clean data
clean_data <- tmort %>%
  mutate(
    Geographic.area = case_when(
      grepl("Tanzania", Geographic.area, ignore.case = TRUE) ~ "Tanzania",
      grepl("DRC|Congo, Dem|Democratic Republic", Geographic.area) ~ "Democratic Republic of the Congo",
      TRUE ~ Geographic.area
    )
  ) %>%
  filter(
    Geographic.area %in% c(
      "Burundi", "Democratic Republic of the Congo",
      "Kenya", "Rwanda", "South Sudan",
      "Tanzania", "Uganda", "Somalia"
    ),
    Observation.Value >= 0
  )
2. LOAD AND PREPARE SHAPEFILES
library(sf)
library(geodata)
library(ggplot2)


shape_dir <- "shapefiles"
dir.create(shape_dir, showWarnings = FALSE)

# Downloading and saving boundaries
eac_shapes <- gadm(
  country = c("BDI", "COD", "KEN", "RWA", "SSD", "TZA", "UGA", "SOM"),
  level = 0,
  path = shape_dir
) %>% 
  st_as_sf() %>%
  rename(Geographic.area = COUNTRY)

# Ploting and saving
map_outline <- ggplot(eac_shapes) +
  geom_sf(fill = NA, color = "black", size = 0.4) +   
  theme_void()

ggsave(
  filename = "EAC_shapes_outline.png",
  plot = map_outline,
  width = 8,
  height = 6,
  dpi = 300
)
3. CREATING MAPS-STATIC
library(tmap)
library(sf)
library(dplyr)

# 1. Prepare data with explicit facet labels
map_data <- clean_data %>%
  mutate(Year = floor(Reference.Date)) %>%
  filter(Year >= 2000, Year <= 2023) %>%
  group_by(Geographic.area, Indicator) %>%
  filter(Year == max(Year)) %>%
  left_join(eac_shapes, by = "Geographic.area") %>%
  st_as_sf() %>%
  mutate(
    Facet_Label = paste(Indicator, "in", Year)  # Create combined label
  )

# 2. Create visualization with single facet dimension
mortality_map <- tm_shape(map_data) +
  tm_polygons(
    fill = "Observation.Value",
    fill.scale = tm_scale_continuous(
      values = "viridis",
      n = 6
    ),
    fill.legend = tm_legend(
      title = "Deaths per 1,000",
      breaks = c(20, 40, 60, 80, 100, 120)
    ),
    col = "white",
    lwd = 0.5
  ) +
  tm_facets(
    by = "Facet_Label", 
    ncol = 2,
    free.coords = FALSE
  ) +
  tm_title("Child Mortality in East Africa (2000-2023)") +
  tm_layout(
    legend.outside = TRUE,
    legend.outside.position = "right",
    panel.label.bg.color = "white"
  ) +
  tm_credits("Data Source: UN IGME\nLatest available year between 2000-2023",
            position = c("left", "bottom"))

# 3. Save and display
print(mortality_map)

tmap_save(mortality_map, "combined_mortality_map.png", 
         width = 14, height = 10, dpi = 300)

This Map shows a simple breakdown of child mortality data from 2000- 2023, split into two categories: newborns (neonatal) and children under five. The source, UNIQME, notes that the numbers reflect the most recent estimates between 2020 and 2023. Below the titles, there’s a simple checklist-style scale grouping mortality rates into seven ranges—from the lowest (0–20 deaths per 1,000 births) to the highest (120–140). The layout is minimal, with clear labels and empty check boxes, likely meant to represent tiers of severity or progress.

4.TREND ANALYSIS
library(tmap)
library(patchwork)


#data structure
str(clean_data)
## 'data.frame':    6255 obs. of  23 variables:
##  $ REF_AREA              : chr  "BDI" "BDI" "BDI" "BDI" ...
##  $ Geographic.area       : chr  "Burundi" "Burundi" "Burundi" "Burundi" ...
##  $ Regional.group        : chr  "" "" "" "" ...
##  $ Indicator             : chr  "Neonatal mortality rate" "Neonatal mortality rate" "Neonatal mortality rate" "Neonatal mortality rate" ...
##  $ Sex                   : chr  "Total" "Total" "Total" "Total" ...
##  $ Wealth.Quintile       : chr  "Total" "Total" "Total" "Total" ...
##  $ Series.Name           : chr  "Demographic and Health Survey 2016-2017 (Direct)" "Demographic and Health Survey 2016-2017 (Direct)" "Demographic and Health Survey 2016-2017 (Direct)" "Demographic and Health Survey 2016-2017 (Direct)" ...
##  $ Series.Year           : chr  "2016-2017" "2016-2017" "2016-2017" "2016-2017" ...
##  $ Reference.Date        : num  1994 1998 2004 2008 2014 ...
##  $ Observation.Value     : num  36.4 41.3 32.8 28.5 23.7 ...
##  $ Lower.Bound           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Upper.Bound           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Standard.Error        : num  8.28 5.77 3.87 4.42 2.98 ...
##  $ Country.notes         : chr  "" "" "" "" ...
##  $ Observation.Status    : chr  "Excluded from IGME" "Included in IGME" "Included in IGME" "Included in IGME" ...
##  $ Unit.of.measure       : chr  "Deaths per 1,000 live births" "Deaths per 1,000 live births" "Deaths per 1,000 live births" "Deaths per 1,000 live births" ...
##  $ Series.Type           : chr  "Direct" "Direct" "Direct" "Direct" ...
##  $ Series.Category       : chr  "DHS" "DHS" "DHS" "DHS" ...
##  $ Series.Method         : chr  "Survey/Census with Full Birth Histories" "Survey/Census with Full Birth Histories" "Survey/Census with Full Birth Histories" "Survey/Census with Full Birth Histories" ...
##  $ Age.Group.of.Women    : chr  "" "" "" "" ...
##  $ Time.Since.First.Birth: chr  "" "" "" "" ...
##  $ Definition            : logi  NA NA NA NA NA NA ...
##  $ Interval              : num  5 5 5 5 5 5 5 5 5 5 ...
summary(clean_data)
##    REF_AREA         Geographic.area    Regional.group      Indicator        
##  Length:6255        Length:6255        Length:6255        Length:6255       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      Sex            Wealth.Quintile    Series.Name        Series.Year       
##  Length:6255        Length:6255        Length:6255        Length:6255       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Reference.Date Observation.Value  Lower.Bound      Upper.Bound    
##  Min.   :1952   Min.   :  9.00    Min.   : 11.07   Min.   : 23.29  
##  1st Qu.:1988   1st Qu.: 67.42    1st Qu.: 45.14   1st Qu.: 70.19  
##  Median :1998   Median :126.40    Median : 96.65   Median :145.45  
##  Mean   :1997   Mean   :131.51    Mean   :108.71   Mean   :160.11  
##  3rd Qu.:2008   3rd Qu.:178.79    3rd Qu.:157.84   3rd Qu.:222.23  
##  Max.   :2024   Max.   :773.17    Max.   :634.89   Max.   :929.16  
##                                   NA's   :2593     NA's   :2593    
##  Standard.Error    Country.notes      Observation.Status Unit.of.measure   
##  Min.   :  1.472   Length:6255        Length:6255        Length:6255       
##  1st Qu.:  7.700   Class :character   Class :character   Class :character  
##  Median : 10.684   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 11.934                                                           
##  3rd Qu.: 14.300                                                           
##  Max.   :180.790                                                           
##  NA's   :4276                                                              
##  Series.Type        Series.Category    Series.Method      Age.Group.of.Women
##  Length:6255        Length:6255        Length:6255        Length:6255       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Time.Since.First.Birth Definition        Interval    
##  Length:6255            Mode:logical   Min.   :1.000  
##  Class :character       NA's:6255      1st Qu.:1.000  
##  Mode  :character                      Median :1.000  
##                                        Mean   :1.757  
##                                        3rd Qu.:1.000  
##                                        Max.   :5.000  
##                                        NA's   :799
#indicators
unique(clean_data$Indicator)
## [1] "Neonatal mortality rate"   "Under-five mortality rate"
# date range
range(clean_data$Reference.Date, na.rm = TRUE)
## [1] 1951.5 2023.5
# Checking country coverage
unique(clean_data$Geographic.area)
## [1] "Burundi"                          "Democratic Republic of the Congo"
## [3] "Kenya"                            "Rwanda"                          
## [5] "Somalia"                          "South Sudan"                     
## [7] "Tanzania"                         "Uganda"
# Sample of mortality values
clean_data %>%
  group_by(Indicator) %>%
  summarise(
    min_value = min(Observation.Value, na.rm = TRUE),
    max_value = max(Observation.Value, na.rm = TRUE),
    avg_value = mean(Observation.Value, na.rm = TRUE)
  )
## # A tibble: 2 × 4
##   Indicator                 min_value max_value avg_value
##   <chr>                         <dbl>     <dbl>     <dbl>
## 1 Neonatal mortality rate        9         70.7      37.8
## 2 Under-five mortality rate      9.87     773.      143.
# Trend Analysis 

# Calculating annual averages for each indicator
trend_data <- clean_data %>%
  group_by(Reference.Date, Indicator) %>%
  summarise(
    avg_mortality = mean(Observation.Value, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(Reference.Date = as.numeric(Reference.Date)) 
5.TREND PLOTS
clean_data <- clean_data %>%
  mutate(Reference.Date = as.numeric(Reference.Date))  

# 1. Under-Five Mortality Trend Plot
under5_plot <- ggplot(clean_data %>% filter(Indicator == "Under-five mortality rate"), 
                      aes(x = Reference.Date, y = Observation.Value)) +
  geom_line(aes(group = Geographic.area, color = Geographic.area), 
            alpha = 0.5, linewidth = 0.7) +
  geom_point(aes(color = Geographic.area), alpha = 0.7, size = 1.5) +
  geom_smooth(aes(group = 1), method = "loess", color = "black", 
              se = FALSE, linewidth = 1.5) +
  scale_color_viridis_d() +
  labs(title = "Under-Five Mortality Trends by Country (1951-2023)",
       x = "Year",
       y = "Mortality Rate (per 1,000 live births)",
       color = "Country") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(nrow = 2))

# 2. Neonatal Mortality Trend Plot
neonatal_plot <- ggplot(clean_data %>% filter(Indicator == "Neonatal mortality rate"), 
                        aes(x = Reference.Date, y = Observation.Value)) +
  geom_line(aes(group = Geographic.area, color = Geographic.area), 
            alpha = 0.5, linewidth = 0.7) +
  geom_point(aes(color = Geographic.area), alpha = 0.7, size = 1.5) +
  geom_smooth(aes(group = 1), method = "loess", color = "black", 
              se = FALSE, linewidth = 1.5) +
  scale_color_viridis_d() +
  labs(title = "Neonatal Mortality Trends by Country (1951-2023)",
       x = "Year",
       y = "Mortality Rate (per 1,000 live births)",
       color = "Country") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(nrow = 2))

# Combine plots
combined_trends <- under5_plot / neonatal_plot +
  plot_annotation(title = "Child Mortality Trends in East African Community",
                  subtitle = "With country-specific trajectories and regional trend line (black)")

# Display the plot
print(combined_trends)

ggsave("trend_map_EAC.png", plot = combined_trends, width = 10, height = 8, dpi = 300)
under5_plot

neonatal_plot

The two trend lines illustrates child mortality trends within the East African Community from 1951 to 2023. The top graph focuses on under-five mortality rates (per 1,000 live births), while the bottom graph depicts neonatal mortality rates (per 1,000 live births) for individual countries: Burundi, Democratic Republic of the Congo, Kenya, Rwanda, Somalia, South Sudan, Tanzania, and Uganda.

In both graphs, each country’s trajectory is represented by a distinct colored line with circular markers, allowing for the visualization of country-specific changes over time. A thick black line overlaid on the country-specific data represents the regional trend line, providing an overall view of mortality rate changes across the East African Community.

Both under-five and neonatal mortality rates show a general downward trend across the region over the period, indicating progress in child survival. However, the graphs also highlight variations in mortality rates and the pace of decline among the different countries within the community. Notably, there appear to be periods of fluctuation and divergence in the country-specific trajectories, suggesting diverse factors influencing child mortality in each nation. The regional trend line smooths out these individual variations to present a broader picture of improvement in child survival within the East African Community over the decades.

6.HIGHEST MORTALITY COUNTRIES IN EAC
# Identify Highest Mortality Countries 
highest_mortality <- clean_data %>%
  filter(Reference.Date == max(Reference.Date)) %>%
  group_by(Indicator) %>%
  slice_max(Observation.Value, n = 1) %>%
  select(Geographic.area, Indicator, Observation.Value)

# Print results
cat("\nCountries with Highest Mortality Rates:\n")
## 
## Countries with Highest Mortality Rates:
print(highest_mortality)
## # A tibble: 2 × 3
## # Groups:   Indicator [2]
##   Geographic.area Indicator                 Observation.Value
##   <chr>           <chr>                                 <dbl>
## 1 South Sudan     Neonatal mortality rate                40.2
## 2 Somalia         Under-five mortality rate             123.

The data reveals critical disparities in child health outcomes across East Africa, with South Sudan reporting a neonatal mortality rate of 40.2 deaths per 1,000 live births—reflecting challenges in maternal and newborn care—while Somalia’s under-five mortality rate soars to 122.7 deaths per 1,000, underscoring the compounded vulnerabilities of conflict, poverty, and limited healthcare access. These figures highlight the urgent need for targeted interventions, particularly in fragile states where systemic barriers perpetuate high child mortality. The stark contrast between neonatal and under-five rates suggests that survival gaps widen significantly in early childhood, demanding prioritized investments in nutrition, immunization, and infectious disease control.